Search CORE

71 research outputs found

In Memoriam: Karen Sparck Jones

Author: B. Grosz
J. Allan
K. Spärck
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones
K. Spärck Jones Synonymy
Peter Willett
Stephen Robertson
Publication venue: 'SAGE Publications'
Publication date: 20/08/2007
Field of study

Crossref

White Rose Research Online

T ${}^2$ K ${}^2$ : The Twitter Top-K Keywords Benchmark

Author: A Guille
AE Gattiker
CD Manning
D Kılınç
DD Lewis
F Ravat
J Darmont
J Ferrarons
J Gray
J O’Shea
JD Cooper
K Spärck Jones
K Spärck Jones
L Wang
S Bringay
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/09/2017
Field of study

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T

{}^2

{}^2

, which features a real tweet dataset and queries with various complexities and selectivities. T

{}^2

{}^2

helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T

{}^2

{}^2

's relevance and genericity, we successfully performed tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand

arXiv.org e-Print Archive

Crossref

HAL

Hal-Diderot

Automatic identification methods on a corpus of twenty five fine-grained Arabic dialects

Author: J Li
JC Watson
K Spärck Jones
OF Zaidan
S Hochreiter
S Kullback
S Malmasi
Subarno Pal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/10/2019
Field of study

International audienceThis research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Discovery of Novel Term Associations in a Document Collection

Author: G. Salton
H.P. Luhn
I. Petrič
Jim Cowie
K. Spärck Jones
M. Segond
M.F. Porter
R.L. Cilibrasi
S. Deerwester
Satanjeev Banerjee
T. Kötter
Publication venue: Springer-Verlag
Publication date: 01/01/2012
Field of study

Non peer reviewe

Crossref

Springer - Publisher Connector

Helsingin yliopiston digitaalinen arkisto

An Arabic Corpus of Fake News: Collection, Analysis and Classification

Author: A Zubiaga
D Lazer
JR Quinlan
K Spärck Jones
N Chomsky
R Procter
V Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/10/2019
Field of study

International audienceOver the last years, with the explosive growth of social media, huge amounts of rumors have been rapidly spread on the internet. Indeed, the proliferation of malicious misinformation and nasty rumors in social media can have harmful effects on individuals and society. In this paper, we investigate the content of the fake news in the Arabic world through the information posted on YouTube. Our contribution is threefold. First, we introduce a novel Arab corpus for the task of fake news analysis, covering the topics most concerned by rumors. We describe the corpus and the data collection process in detail. Second, we present several exploratory analysis on the harvested data in order to retrieve some useful knowledge about the transmission of rumors for the studied topics. Third, we test the possibility of discrimination between rumor and no rumor comments using three machine learning classifiers namely, Support Vector Machine (SVM), Decision Tree (DT) and Multinomial Naïve Bayes (MNB)

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Evaluation of a Bayesian inference network for ligand-based virtual screening

Author: A Abdo
A Bender
AG Maldonado
AN Jain
AR Leach
AR Leach
Beining Chen
Christoph Mueller
CX Zhai
D Metzler
EJ Gardiner
EM Voorhees
G Salton
GW Bemis
H Eckert
H Turtle
J Bajorath
J Hert
J Hert
J-F Truchon
JA Grant
JD Holliday
JP Callan
JP Callan
JR Fischer
K Spärck Jones
K Spärck Jones
N Nikolova
P Prathipati
P Willett
P Willett
P Willett
P Willett
P Willett
Peter Willett
RC Glen
RD Brown
RP Sheridan
RP Sheridan
S Siegel
SJ Edgar
T Lengauer
T Strohman
TI Oprea
WR Greiff
X Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background Bayesian inference networks enable the computation of the probability that an event will occur. They have been used previously to rank textual documents in order of decreasing relevance to a user-defined query. Here, we modify the approach to enable a Bayesian inference network to be used for chemical similarity searching, where a database is ranked in order of decreasing probability of bioactivity. Results Bayesian inference networks were implemented using two different types of network and four different types of belief function. Experiments with the MDDR and WOMBAT databases show that a Bayesian inference network can be used to provide effective ligand-based screening, especially when the active molecules being sought have a high degree of structural homogeneity; in such cases, the network substantially out-performs a conventional, Tanimoto-based similarity searching system. However, the effectiveness of the network is much less when structurally heterogeneous sets of actives are being sought. Conclusion A Bayesian inference network provides an interesting alternative to existing tools for ligand-based virtual screening

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

White Rose Research Online

Towards a Better Understanding of the Relationship between Probabilistic Models in IR

Author: C. Zhai
C. Zhai
C. Zhai
C.D. Manning
D.W. Hosmer
F. Crestani
J. Lafferty
J.M. Ponte
K. Spärck-Jones
N. Fuhr
R.W.P. Luk
S.E. Robertson
S.E. Robertson
S.E. Robertson
S.E. Robertson
T. Roelleke
T. Roelleke
V. Lavrenko
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work

Crossref

Ghent University Academic Bibliography

Characterizing eve: Analysing cybercrime actors in a large underground forum

Author: A Field
A Hutchings
A Hutchings
AK Sood
DM Blei
EH Sutherland
GB Vold
K Spärck-Jones
M Karami
MP Marcus
R Anderson
RL Thorndike
S Lloyd
TJ Holt
V Garg
W Chang
X Zhang
Publication venue: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Publication date: 01/01/2018
Field of study

Underground forums contain many thousands of active users, but the vast majority will be involved, at most, in minor levels of deviance. The number who engage in serious criminal activity is small. That being said, underground forums have played a significant role in several recent high-profile cybercrime activities. In this work we apply data science approaches to understand criminal pathways and characterize key actors related to illegal activity in one of the largest and longest- running underground forums. We combine the results of a logistic regression model with k-means clustering and social network analysis, verifying the findings using topic analysis. We identify variables relating to forum activity that predict the likelihood a user will become an actor of interest to law enforcement, and would therefore benefit the most from intervention. This work provides the first step towards identifying ways to deter the involvement of young people away from a career in cybercrime.Alan Turing Institut

Crossref

Universidad Carlos III de Madrid e-Archivo

Apollo (Cambridge)

On the role of novelty for search result diversification

Author: C. D. Manning
Craig Macdonald
Iadh Ounis
K. Järvelin
K. Spärck Jones
M. D. Gordon
R. Song
Rodrygo L. T. Santos
S. E. Robertson
S. Kirkpatrick
W. Goffman
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

On morphological hierarchical representations for image processing and spatial data clustering

Author: A. Baraldi
A. Rosenfeld
C. Jardine
C. Mattiussi
C. Ronse
C. Zahn
D. Wishart
E. Breen
F. Dias
F. Meyer
F. Meyer
F. Meyer
G. Bertrand
G. Estabrook
G. Matheron
G. Ouzounis
J. Cousty
J. Cousty
J. Cousty
J. Cousty
J. Cousty
J. Gower
J. Kruskal
J. Serra
J. Shi
J.P. Barthélemy
J.P. Benzécri
K. Florek
K. Spärck Jones
L. Gueguen
L. Guigues
L. Guigues
L. Hubert
L. Hubert
L. Hubert
L. Najman
L. Najman
L. Najman
L. Najman
L. Vincent
M. Nagao
M. Nagao
N. Ahuja
N. Jardine
N. Jardine
N. Jardine
O. Morris
P. Arbeláez
P. Felzenszwalb
P. Nacken
P. Salembier
P. Salembier
P. Salembier
P. Sneath
P. Soille
P. Soille
P. Soille
P. Soille
P. Soille
P. Soille
P. Soille
R. Adams
R. Cormack
R. Graham
R. Jones
R. Levillain
R. Marfil
R. Sokal
S. Beucher
S. Horowitz
S. Johnson
S. Zucker
T. Kong
T. Sørensen
W.G. Kropatsch
Z. Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Hierarchical data representations in the context of classi cation and data clustering were put forward during the fties. Recently, hierarchical image representations have gained renewed interest for segmentation purposes. In this paper, we briefly survey fundamental results on hierarchical clustering and then detail recent paradigms developed for the hierarchical representation of images in the framework of mathematical morphology: constrained connectivity and ultrametric watersheds. Constrained connectivity can be viewed as a way to constrain an initial hierarchy in such a way that a set of desired constraints are satis ed. The framework of ultrametric watersheds provides a generic scheme for computing any hierarchical connected clustering, in particular when such a hierarchy is constrained. The suitability of this framework for solving practical problems is illustrated with applications in remote sensing

arXiv.org e-Print Archive

JRC Publications Repository

Crossref